Extracting k most important groups from data efficiently

نویسندگان

  • Man Lung Yiu
  • Nikos Mamoulis
  • Vagelis Hristidis
چکیده

We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is “find the 10 combinations of product-type and month with the largest sum of sales”. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing a smart algorithm for determining stock exchange signals by data mining

One of the most important problems in modern finance is finding efficient ways to summarize and visualize the stock exchange market. This research proposes a smart algorithm by means of valuable big data that is generated by stock exchange market and different kinds of methodology to present a smart model.In this paper, we investigate relationships between the data and access to their lat...

متن کامل

Fuzzy clustering of time series data: A particle swarm optimization approach

With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly used function of data mining, has attracted many researchers in computer science. Because o...

متن کامل

Presenting Governance Models: a Case Study for Scientific Groups of National Elite Foundation

One of the most important components in the university teaching-learning context is the relationship between teacher and student. The formation of scientific groups is a new action that is running by the National Elite Foundation. In this time, one of the most important issues of the National Elite Foundation is the recognition of administrative requirements for the Strengthen and expansion of ...

متن کامل

Homework 3: Wikipedia Clustering

Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based on their features. An ideal clustering algorithm maximizes feature similarities within a cluster while minimizing the feature similarities across clusters. Some of the most common clustering algorithms include spectral clustering and k-means clustering. This project essential...

متن کامل

EEG Based Brain Computer Interface Hand Grasp Control: Feature Extraction Method MTCSP

Brain-Computer Interfaces (BCIs) are communication systems, which enable users to send commands to computers by using brain activity only; this activity being generally measured by Electroencephalography (EEG). BCIs are generally designed according to a pattern recognition approach, i.e., by extracting features from EEG signals, and by using a classifier to identify the user’s mental state from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 66  شماره 

صفحات  -

تاریخ انتشار 2008